[Kernel] Use pre-allocated output buffer for triton kernel fused_experts#29219
Merged
jeejeelee merged 4 commits intovllm-project:mainfrom Nov 26, 2025
Merged
[Kernel] Use pre-allocated output buffer for triton kernel fused_experts#29219jeejeelee merged 4 commits intovllm-project:mainfrom
jeejeelee merged 4 commits intovllm-project:mainfrom
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This PR is to use pre-allocated output buffer for triton kernel matmal_ogs
moe_problem_size()function inOAITritonExperts, because the super classmoe_problem_sizeexpects N to be the second dimension of w1, see here. But triton kernels expect N to be the third dimension of w1. This will cause N assigned the value of K incorrectly for triton.Test Plan
Test Result
Unit test passed
Accuracy Testing
Benchmark
Baseline:
PR:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.cc @varun-sundar-rabindranath